Every two years, the Centers for Disease Control and Prevention conduct the Youth Risk Behavior Surveillance System (YRBSS) survey, where it takes data from high schoolers (9th through 12th grade), to analyze health patterns. We will work with a selected group of variables from a random sample of observations during one of the years the YRBSS was conducted.
This data is part of the openintro textbook and we can load and inspect it. There are observations on 13 different variables, some categorical and some numerical. The meaning of each variable can be found by bringing up the help file: ?yrbss
## Rows: 13,583
## Columns: 13
## $ age <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15, …
## $ gender <chr> "female", "female", "female", "female", "fem…
## $ grade <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9",…
## $ hispanic <chr> "not", "not", "hispanic", "not", "not", "not…
## $ race <chr> "Black or African American", "Black or Afric…
## $ height <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88, …
## $ weight <dbl> NA, NA, 84.4, 55.8, 46.7, 67.1, 131.5, 71.2,…
## $ helmet_12m <chr> "never", "never", "never", "never", "did not…
## $ text_while_driving_30d <chr> "0", NA, "30", "0", "did not drive", "did no…
## $ physically_active_7d <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, 7,…
## $ hours_tv_per_school_day <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5+"…
## $ strength_training_7d <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, 7,…
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "<5…
skimr::skim(yrbss) #get a feel for missing values, summary statistics of numerical variables, and a very rough histogram| Name | yrbss |
| Number of rows | 13583 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 8 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| gender | 12 | 1.00 | 4 | 6 | 0 | 2 | 0 |
| grade | 79 | 0.99 | 1 | 5 | 0 | 5 | 0 |
| hispanic | 231 | 0.98 | 3 | 8 | 0 | 2 | 0 |
| race | 2805 | 0.79 | 5 | 41 | 0 | 5 | 0 |
| helmet_12m | 311 | 0.98 | 5 | 12 | 0 | 6 | 0 |
| text_while_driving_30d | 918 | 0.93 | 1 | 13 | 0 | 8 | 0 |
| hours_tv_per_school_day | 338 | 0.98 | 1 | 12 | 0 | 7 | 0 |
| school_night_hours_sleep | 1248 | 0.91 | 1 | 3 | 0 | 7 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 77 | 0.99 | 16.16 | 1.26 | 12.00 | 15.0 | 16.00 | 17.00 | 18.00 | ▁▂▅▅▇ |
| height | 1004 | 0.93 | 1.69 | 0.10 | 1.27 | 1.6 | 1.68 | 1.78 | 2.11 | ▁▅▇▃▁ |
| weight | 1004 | 0.93 | 67.91 | 16.90 | 29.94 | 56.2 | 64.41 | 76.20 | 180.99 | ▆▇▂▁▁ |
| physically_active_7d | 273 | 0.98 | 3.90 | 2.56 | 0.00 | 2.0 | 4.00 | 7.00 | 7.00 | ▆▂▅▃▇ |
| strength_training_7d | 1176 | 0.91 | 2.95 | 2.58 | 0.00 | 0.0 | 3.00 | 5.00 | 7.00 | ▇▂▅▂▅ |
weight of participants in kilograms1. From skimr::skim() above, we can see there are 1004 observations missing in weight
2. Using visualization and summary statistics to describe the distribution of weights
yrbss %>%
summarise(mean_weight = mean(weight, na.rm=TRUE), #mean is 67.9065
median_weight=median(weight, na.rm = TRUE)) #median is 64.41## # A tibble: 1 x 2
## mean_weight median_weight
## <dbl> <dbl>
## 1 67.9 64.4
yrbss %>%
filter(!is.na(weight)) %>% #remove the missing observations
#plotting the distribution
ggplot(mapping = aes(x=weight)) +
geom_density() +
#calculating the mean and median of weight and plotting them onto the distribution, mean = 67.91, median = 64.41
geom_vline(xintercept = mean(yrbss$weight,na.rm=TRUE), colour = "lightslateblue")+
geom_vline(xintercept = median(yrbss$weight, na.rm = TRUE), colour = "tomato") +
annotate("text", x = 80, y = 0.031, label = "Mean=67.91", colour = "lightslateblue")+
annotate("text", x = 50, y = 0.031, label = "Median=64.41", colour = "tomato")+
labs(x = "Weight in Kilograms",
y= "Density of observations",
title = "Right-Skewed Distribution of High Schoolers' Weights",
subtitle = "Most high schoolers' weights are concentrated around 60kg \n however a significant number far above that average")+
theme_economist_white()+
NULL3. Possible relationship between a high schooler’s weight and their physical activity
First of all, creating a new variable physical_3plus, which will be yes if they are physically active for at least 3 days a week, and no otherwise.
There are 66.91% high schoolers are physically active for at least 3 days a week.
yrbss <- yrbss %>%
mutate(physical_3plus = ifelse(physically_active_7d >= 3, "yes", "no"))
yrbss %>% filter(!is.na(physical_3plus)) %>%
group_by(physical_3plus) %>%
summarise(count = n()) %>%
mutate(prop= count/sum(count))## # A tibble: 2 x 3
## physical_3plus count prop
## <chr> <int> <dbl>
## 1 no 4404 0.331
## 2 yes 8906 0.669
The 95% confidence interval for the population proportion of high schools that are NOT active 3 or more days per week is (0.32,0.34)
yrbss %>%
filter(!is.na(physical_3plus)) %>%
group_by(physical_3plus) %>%
summarise(count = n()) %>%
mutate(
prop= count/sum(count),
SE_active3 = sqrt(prop*(1-prop)/sum(count)),
lower95_ci = prop + qnorm(0.025,0,1)*SE_active3,
upper95_ci = prop + qnorm(0.975,0,1)*SE_active3) %>%
filter(physical_3plus == "no")## # A tibble: 1 x 6
## physical_3plus count prop SE_active3 lower95_ci upper95_ci
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 no 4404 0.331 0.00408 0.323 0.339
Make a boxplot of physical_3plus vs. weight
yrbss %>%
#making a boxplot
filter(!is.na(physical_3plus), !is.na(weight)) %>%
ggplot(mapping = aes(x=weight, y = physical_3plus)) +
geom_boxplot() +
labs(x = "Weight in Kilograms",
y= "Physically active for at least 3 days a week?",
title = "Those who exercise turn out heavier than those who don't") +
NULL We expected to see higher weights for those who exercise less, while the opposite seems to be the case. This might be because weight is not a perfect measure of fitness and can be associated to greater muscle mass for example or other factors of those who exercise. Another explanation could be that some of the people that responded yes in survey might be exercising more often precisely to get in better shape. Furthermore, although the average is higher for those who exerecise, those who do NOT exercise have higher variance and much larger outliers, which are potentially those who are overweight or obese.
Boxplots show how the medians of the two distributions compare, but we can also compare the means of the distributions using either a confidence interval or a hypothesis test.
yrbss %>%
group_by(physical_3plus) %>%
filter(!is.na(physical_3plus)) %>%
summarise(mean_weight = mean(weight, na.rm = TRUE),
sd_weight = sd(weight, na.rm=TRUE),
count = n(),
se_weight = sd_weight/sqrt(count),
t_critical = qt(0.975, count-1),
margin_of_error = t_critical * se_weight,
lower = mean_weight - t_critical * se_weight,
upper = mean_weight + t_critical * se_weight
)## # A tibble: 2 x 9
## physical_3plus mean_weight sd_weight count se_weight t_critical
## <chr> <dbl> <dbl> <int> <dbl> <dbl>
## 1 no 66.7 17.6 4404 0.266 1.96
## 2 yes 68.4 16.5 8906 0.175 1.96
## # … with 3 more variables: margin_of_error <dbl>, lower <dbl>, upper <dbl>
There is an observed difference of about 1.77kg (68.44 - 66.67), and we notice that the two confidence intervals do not overlap. It seems that the difference is at least 95% statistically significant. Let us also conduct a hypothesis test.
Write the null and alternative hypotheses for testing whether mean weights are different for those who exercise at least 3 times a week and those who don’t.
##
## Welch Two Sample t-test
##
## data: weight by physical_3plus
## t = -5, df = 7479, p-value = 9e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.42 -1.12
## sample estimates:
## mean in group no mean in group yes
## 66.7 68.4
The test result shows that the p-value is very low, smaller than 1%, so we reject the null hypothesis and conclude that the mean weights of two groups who exercise at least 3 times a week and those who don’t, are in fact different in a statistically significant way.
inferNext, we will introduce a new function, hypothesize, that falls into the infer workflow. You will use this method for conducting hypothesis tests.
# initialize the test, which we save as `obs_diff`
obs_diff <- yrbss %>%
specify(weight ~ physical_3plus) %>%
# the statistic we are searching for is the difference in means, with the order being yes - no != 0
calculate(stat = "diff in means", order = c("yes", "no"))
# simulate the test on the null distribution, which we save as null
null_dist <- yrbss %>%
specify(weight ~ physical_3plus) %>%
# set the null hypothesis as a test for independence, i.e., that there is no difference between the two population means. In one sample cases, the null argument can be set to *point* to test a hypothesis relative to a point estimate
hypothesize(null = "independence") %>%
# the `type` argument within generate is set to permute, which is the argument when generating a null distribution for a hypothesis test
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("yes", "no"))Visualize this null distribution
Visualising to see how many of these null permutations have a difference of at least obs_stat of 1.77
Calculating the p-value for the hypothesis test using the function infer::get_p_value()
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0
Above is the standard workflow for performing hypothesis tests.
Recall the IMBD ratings data in homework1. We would like to explore whether the mean IMDB rating for Steven Spielberg and Tim Burton are the same or not. Below is the calculated the confidence intervals for the mean ratings of these two directors and as you can see they overlap. We will reproduce this graph and run a hpothesis test using both the t.test command and the infer package to simulate from a null distribution, where we assume zero difference between the two.
## Rows: 2,961
## Columns: 11
## $ title <chr> "Avatar", "Titanic", "Jurassic World", "The Aveng…
## $ genre <chr> "Action", "Drama", "Action", "Action", "Action", …
## $ director <chr> "James Cameron", "James Cameron", "Colin Trevorro…
## $ year <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 2…
## $ duration <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, …
## $ gross <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+08,…
## $ budget <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+08,…
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 92…
## $ votes <dbl> 886204, 793059, 418214, 995415, 1676169, 534658, …
## $ reviews <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 3…
## $ rating <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2,…
g1 <- Burton_Spielberg %>%
group_by(director) %>%
summarize(n = n(),
mean_rating = mean(rating, na.rm = TRUE),
SD_directors = sd(rating, na.rm = TRUE),
SE_directors = SD_directors/sqrt(n),
t_critical = qt(0.975, n-1),
lower95_ci = mean_rating - t_critical*SE_directors,
upper95_ci = mean_rating + t_critical*SE_directors)
g1 %>%
ggplot(mapping = aes(x = mean_rating, y= reorder(director, mean_rating)))+
geom_rect(aes(xmin = max(lower95_ci), xmax = min(upper95_ci), ymin = -Inf, ymax = Inf), alpha = 0.1, fill = "black") +
geom_point(aes(color = as.factor(director)), show.legend = FALSE, size = 5) +
geom_errorbar(aes(xmin = lower95_ci, xmax = upper95_ci, color = as.factor(director)), width = 0.1, show.legend = FALSE, size = 1.8) +
theme_minimal()+
labs(x = "Mean IMDB Rating",
y = "",
title = "Do Spielberg and Burton have the same mean IMDB rating?",
subtitle = "95% confidence intervals overlap"
)+
geom_text(aes(label = round(mean_rating,2)), vjust=-1, size=6)+
geom_text(aes(x=upper95_ci, label = round(upper95_ci,2)), vjust=-1, size=5)+
geom_text(aes(x=lower95_ci, label = round(lower95_ci,2)), vjust=-1, size=5)+
NULL##
## Welch Two Sample t-test
##
## data: rating by director
## t = 3, df = 31, p-value = 0.01
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.16 1.13
## sample estimates:
## mean in group Steven Spielberg mean in group Tim Burton
## 7.57 6.93
infer# initialize the test, which we save as `directors_diff`
directors_diff <- Burton_Spielberg %>%
specify(rating ~ director) %>%
# the statistic we are searching for is the difference in means, with the order being "Steven Spielberg", "Tim Burton"
calculate(stat = "diff in means", order = c("Steven Spielberg", "Tim Burton"))
# simulate the test on the null distribution, which we save as null
directors_null_dist <- Burton_Spielberg %>%
specify(rating ~ director) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("Steven Spielberg", "Tim Burton"))
# Visualising to see how many of these null permutations have a difference
directors_null_dist %>% visualize() +
shade_p_value(obs_stat = directors_diff, direction = "two-sided")# Calculating the p-value for the hypothesis test
directors_null_dist %>%
get_p_value(obs_stat = directors_diff, direction = "two_sided")## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0.03
From the hypothesis test, we can see that the t-stat is 3 and the p-value is 1% which is smaller than 5%, so we reject the null hypothesis and conclude that the the difference in mean IMDB ratings for Steven Spielberg and Tim Burton is statistically significant, specifically the mean rating of Steven Spielberg is higher. This is just like what we expected, as we all prefer Steven Spielberg.
At the last board meeting of Omega Group Plc., the headquarters of a large multinational company, the issue was raised that women were being discriminated in the company, in the sense that the salaries were not the same for male and female executives. A quick analysis of a sample of 50 employees (of which 24 men and 26 women) revealed that the average salary for men was about 8,700 higher than for women. This seemed like a considerable difference, so it was decided that a further analysis of the company salaries was warranted.
You are asked to carry out the analysis. The objective is to find out whether there is indeed a significant difference between the salaries of men and women, and whether the difference is due to discrimination or whether it is based on another, possibly valid, determining factor.
## Rows: 50
## Columns: 3
## $ salary <dbl> 81894, 69517, 68589, 74881, 65598, 76840, 78800, 70033, 63…
## $ gender <chr> "male", "male", "male", "male", "male", "male", "male", "m…
## $ experience <dbl> 16, 25, 15, 33, 16, 19, 32, 34, 1, 44, 7, 14, 33, 19, 24, …
The data frame omega contains the salaries for the sample of 50 executives in the company. Can you conclude that there is a significant difference between the salaries of the male and female executives?
omega %>%
group_by(gender) %>%
summarise(mean_salary = mean(salary),
sd_salary = sd(salary),
count2 = n(),
se_salary = sd_salary/sqrt(count2),
t_critical2 = qt(0.975, count2-1),
margin_of_error2 = t_critical2*se_salary,
lower_ci = mean_salary - t_critical2*se_salary,
upper_ci = mean_salary + t_critical2*se_salary)## # A tibble: 2 x 9
## gender mean_salary sd_salary count2 se_salary t_critical2 margin_of_error2
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 female 64543. 7567. 26 1484. 2.06 3056.
## 2 male 73239. 7463. 24 1523. 2.07 3151.
## # … with 2 more variables: lower_ci <dbl>, upper_ci <dbl>
There is an observed difference of about 8,696$ in salary, and we notice that the two confidence intervals do not overlap. It seems that the difference is at least 95% statistically significant. Let us also conduct a hypothesis test.
##
## Welch Two Sample t-test
##
## data: salary by gender
## t = -4, df = 48, p-value = 2e-04
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -12973 -4420
## sample estimates:
## mean in group female mean in group male
## 64543 73239
# hypothesis testing using infer package
# initialize the test, which we save as `directors_diff`
salary_diff <- omega %>%
specify(salary ~ gender) %>%
# the statistic we are searching for is the difference in means, with the order being "male", "female"
calculate(stat = "diff in means", order = c("male", "female"))
# simulate the test on the null distribution, which we save as null
salary_null_dist <- omega %>%
specify(salary ~ gender) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("male", "female"))
# Visualising to see how many of these null permutations have a difference
salary_null_dist %>% visualize() +
shade_p_value(obs_stat = salary_diff, direction = "two-sided")# Calculating the p-value for the hypothesis test
salary_null_dist %>%
get_p_value(obs_stat = salary_diff, direction = "two_sided")## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0
The test result shows that the p-value is very low, smaller than 1%, so we reject the null hypothesis and conclude the difference in mean salaries between genders is statistically significant. Let us look at the correlation analysis.
# make gender a binary variable so that a correlation analysis is possible (male = 1, female = 0)
omega2 <- omega %>%
mutate(gender_num = ifelse(gender == "male",1,0))
cor(salary ~ gender_num, data = omega2, method = "pearson")## [1] 0.508
The correlation coefficient of 0.508 indicates that there is indeed a positive correlation between the gender of the executives and their salary. Let us know perform a linear regression.
##
## Call:
## lm(formula = salary ~ gender_num, data = omega2)
##
## Coefficients:
## (Intercept) gender_num
## 64543 8696
##
## Call:
## lm(formula = salary ~ gender_num, data = omega2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18471 -4780 127 5484 14257
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64543 1474 43.78 < 2e-16 ***
## gender_num 8696 2128 4.09 0.00017 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7520 on 48 degrees of freedom
## Multiple R-squared: 0.258, Adjusted R-squared: 0.243
## F-statistic: 16.7 on 1 and 48 DF, p-value: 0.000165
Looking at the results of our linear regression model, unsurprisingly conveys the same conclusion, a statistically significant difference in salaries between the genders and displays the same difference of $8,696 observed doing the confidence interval analysis.
In conclusion, we observed a significant gap in salaries between men and women at Omega. However, we cannot yet conclude that this gap at Omega is a result of gender discrimination as there may be other factors, which we will investigate later, that have an impact on this relationship. Apart from logic the other hint we get that our relationship may be suffering from omitted variable bias is the low R^2 value of around 25%, which although not a perfect measure suggests that only 25% of the changes in salary can be explained by gender in this dataset.
At the board meeting, someone raised the issue that there was indeed a substantial difference between male and female salaries, but that this was attributable to other reasons such as differences in experience. A questionnaire send out to the 50 executives in the sample reveals that the average experience of the men is approximately 21 years, whereas the women only have about 7 years experience on average (see table below).
## gender min Q1 median Q3 max mean sd n missing
## 1 female 0 0.25 3.0 14.0 29 7.38 8.51 26 0
## 2 male 1 15.75 19.5 31.2 44 21.12 10.92 24 0
Based on this evidence, can you conclude that there is a significant difference between the experience of the male and female executives? Perform similar analyses as in the previous section. Does your conclusion validate or endanger your conclusion about the difference in male and female salaries?
Firstly, let us look at the confidence intervals again
omega %>%
group_by(gender) %>%
summarise(mean_exp = mean(experience),
sd_exp = sd(experience),
count = n(),
se_exp = sd_exp/sqrt(count),
t_critical = qt(0.975, count-1),
margin_of_error = t_critical*se_exp,
lower_ci = mean_exp - t_critical*se_exp,
upper_ci = mean_exp + t_critical*se_exp)## # A tibble: 2 x 9
## gender mean_exp sd_exp count se_exp t_critical margin_of_error lower_ci
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 female 7.38 8.51 26 1.67 2.06 3.44 3.95
## 2 male 21.1 10.9 24 2.23 2.07 4.61 16.5
## # … with 1 more variable: upper_ci <dbl>
As stated above, there is a difference of about 14 years in experience between male and female employees in our dataset, and we notice that the two confidence intervals do not overlap. It seems that the difference is at least 95% statistically significant. Let us also conduct a hypothesis test.
##
## Welch Two Sample t-test
##
## data: experience by gender
## t = -5, df = 43, p-value = 1e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -19.35 -8.13
## sample estimates:
## mean in group female mean in group male
## 7.38 21.12
# hypothesis testing using infer package
# initialize the test, which we save as `directors_diff`
exp_diff <- omega %>%
specify(experience ~ gender) %>%
# the statistic we are searching for is the difference in means, with the order being "male", "female"
calculate(stat = "diff in means", order = c("male", "female"))
# simulate the test on the null distribution, which we save as null
exp_null_dist <- omega %>%
specify(experience ~ gender) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("male", "female"))
# Visualising to see how many of these null permutations have a difference
exp_null_dist %>% visualize() +
shade_p_value(obs_stat = exp_diff, direction = "two-sided")# Calculating the p-value for the hypothesis test
exp_null_dist %>%
get_p_value(obs_stat = exp_diff, direction = "two_sided")## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0
The test result shows that the p-value is very low, smaller than 1%, so we reject the H0 and think the mean of years of experience between genders are different. Let us look at the correlation analysis.
## [1] 0.584
The correlation coefficient of 0.584 indicates that there is indeed a positive correlation between the gender of the executives and their years of experience. We also find that the correlation between experience and gender is higher than the correlation between salary and gender. Let us now perform a linear regression.
##
## Call:
## lm(formula = experience ~ gender_num, data = omega2)
##
## Coefficients:
## (Intercept) gender_num
## 7.38 13.74
##
## Call:
## lm(formula = experience ~ gender_num, data = omega2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.12 -6.32 -2.25 8.37 22.88
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.38 1.91 3.87 0.00033 ***
## gender_num 13.74 2.76 4.98 8.5e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.74 on 48 degrees of freedom
## Multiple R-squared: 0.341, Adjusted R-squared: 0.327
## F-statistic: 24.8 on 1 and 48 DF, p-value: 8.51e-06
Looking at the results of our linear regression model, we can see that both p-values are well below 1% and therefore indicate a statistical significance of this regression model. However, both our R-squared and adj. R-squared values only amount to 0.341 and 0.327 respectively. In other words, only c. 34% of the variance for our dependent variable (experience) can be explained by our independent variable (gender).
Concluding, we can state that there is indeed a significant difference in years of experience between men and women in our dataset. Tying this conclusion to our previous findings about the relationship between salary and gender. Although there is a difference in salaries between the genders, finding out that there is also a significant difference in levels of experience potentially undermines a conclusion that this difference is a result of gender discrimination. As the difference may just be as a result of these differences in experience. However to conclude that we need to understand the relationship between salary and experience and see if this relationship is the same between genders, ie. that both genders are equally being compensated for additional experience
Someone at the meeting argues that clearly, a more thorough analysis of the relationship between salary and experience is required before any conclusion can be drawn about whether there is any gender-based salary discrimination in the company.
Analyse the relationship between salary and experience. Draw a scatterplot to visually inspect the data
omega %>%
ggplot(mapping = aes(x= experience, y = salary, color = gender)) +
geom_point()+
labs(title = "Greater experience results in higher salaries",
subtitle = "Male executives at Omega are more experienced",
x = "Salary in USD",
y = "Years of experience")## [1] 0.803
##
## Welch Two Sample t-test
##
## data: experience by gender
## t = -5, df = 43, p-value = 1e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -19.35 -8.13
## sample estimates:
## mean in group female mean in group male
## 7.38 21.12
You can use GGally:ggpairs() to create a scatterplot and correlation matrix. Essentially, we change the order our variables will appear in and have the dependent variable (Y), salary, as last in our list. We then pipe the dataframe to ggpairs() with aes arguments to colour by gender and make ths plots somewhat transparent (alpha = 0.3).
omega %>%
select(gender, experience, salary) %>% #order variables they will appear in ggpairs()
ggpairs(aes(colour=gender, alpha = 0.3))+
theme_bw()The scatterplot between the experience and and salary shows that there is 7 female executives that have just been hired into their role that have lower wages and a further 7 that have less than 5 years of experiences that also have lower wages than the more experienced male executives. It also shows that conversely the most experienced executives that have been there for 30+ years and that are among the best compensated in the company are exclusively male. Furthermore, from the right middle panel we gather that experience has a greater correlation with wages for female executives rather than their male counterparts, perhaps hinting at other factors that play a role in compensation, perhaps the size of team that each executive has or geographic coverage. In conclusion, when discussing the issue of pay discrimination at this company it seems that the disparity between wages comes from the fact that historically the company only had male executives. However now the company has made an effort to hire predominantly females into executive positions and that with time as the most experienced (male) executives start to retire and more importantly the generally less experienced female executives gain years of experience their compensation should converge, therefore the real test at Omega will be repeating this analysis in 10 years time
At the end of this challenge, we will produce this chart that uses the FRED database to downlaod historical yield curve rates and plot the yield curves since 1999.
# Get a list of FRED codes for US rates and US yield curve; choose monthly frequency
# to see, eg., the 3-month T-bill https://fred.stlouisfed.org/series/TB3MS
tickers <- c('TB3MS', # 3-month Treasury bill (or T-bill)
'TB6MS', # 6-month
'GS1', # 1-year
'GS2', # 2-year, etc....
'GS3',
'GS5',
'GS7',
'GS10',
'GS20',
'GS30') #.... all the way to the 30-year rate
# Turn FRED codes to human readable variables
myvars <- c('3-Month Treasury Bill',
'6-Month Treasury Bill',
'1-Year Treasury Rate',
'2-Year Treasury Rate',
'3-Year Treasury Rate',
'5-Year Treasury Rate',
'7-Year Treasury Rate',
'10-Year Treasury Rate',
'20-Year Treasury Rate',
'30-Year Treasury Rate')
maturity <- c('3m', '6m', '1y', '2y','3y','5y','7y','10y','20y','30y')
# by default R will sort these maturities alphabetically; but since we want
# to keep them in that exact order, we recast maturity as a factor
# or categorical variable, with the levels defined as we want
maturity <- factor(maturity, levels = maturity)
# Create a lookup dataset
mylookup<-data.frame(symbol=tickers,var=myvars, maturity=maturity)
# Take a look:
mylookup %>%
knitr::kable()| symbol | var | maturity |
|---|---|---|
| TB3MS | 3-Month Treasury Bill | 3m |
| TB6MS | 6-Month Treasury Bill | 6m |
| GS1 | 1-Year Treasury Rate | 1y |
| GS2 | 2-Year Treasury Rate | 2y |
| GS3 | 3-Year Treasury Rate | 3y |
| GS5 | 5-Year Treasury Rate | 5y |
| GS7 | 7-Year Treasury Rate | 7y |
| GS10 | 10-Year Treasury Rate | 10y |
| GS20 | 20-Year Treasury Rate | 20y |
| GS30 | 30-Year Treasury Rate | 30y |
df <- tickers %>% tidyquant::tq_get(get="economic.data",
from="1960-01-01") # start from January 1960
glimpse(df)## Rows: 6,774
## Columns: 3
## $ symbol <chr> "TB3MS", "TB3MS", "TB3MS", "TB3MS", "TB3MS", "TB3MS", "TB3MS",…
## $ date <date> 1960-01-01, 1960-02-01, 1960-03-01, 1960-04-01, 1960-05-01, 1…
## $ price <dbl> 4.35, 3.96, 3.31, 3.23, 3.29, 2.46, 2.30, 2.30, 2.48, 2.30, 2.…
This dataframe df has three columns (variables):
symbol: the FRED database ticker symboldate: already a date objectprice: the actual yield on that date#Join this datframe 'df' with dataframe 'mylookup' to have a more readable version
yield_curve <-left_join(df,mylookup,by="symbol") yield_curve %>%
mutate(var=factor(var, levels=c('3-Month Treasury Bill','6-Month Treasury Bill','1-Year Treasury Rate','2-Year Treasury Rate',"3-Year Treasury Rate","5-Year Treasury Rate","7-Year Treasury Rate","10-Year Treasury Rate","20-Year Treasury Rate","30-Year Treasury Rate"))) %>%
#plotting dates on x-axis and yield on y-axis with color represented by maturity
ggplot(mapping = aes(x=date, y=price, color = var), show.legend = FALSE) +
geom_line(show.legend = FALSE) +
#to facet by the maturity and have 2 columns for the graphs
facet_wrap(~var,ncol=2) +
#titles for the plot
labs(title = "Yields on US Treasury rates since 1960",
x = "",
y = "%",
caption = "Source: St. Louis Federal Reserve Economic Database (FRED)") +
#adding a theme
theme_bw() +
#no legend
theme(legend.position = "none") yield_curve %>%
#filter to show treasuries in and after 1999
filter(year(date)>=1999) %>%
#plotting maturity on x axis and yield on y axis, and it is grouped by month colored by year
ggplot(aes(
x = maturity,
y = price,
group = month(date),
color = as.factor(year(date)))) +
geom_line() +
#faceted by year with graphs in 4 columns
facet_wrap(~ year(date), ncol = 4) +
theme_bw() +
#labelling the axes and titles
labs(title = "US Yield Curve",
x = "Maturity",
y = "Yield (%)",
caption = "Source: St. Louis Federal Reserve Economic Database (FRED)") +
#no legend
theme(legend.position = "none",
axis.text.x=element_text(size=7))yield_curve %>%
#filtering to have treasuries in or after 1999 and only 3 month and 10 year bills and rates
filter(year(date) >= 1999,
maturity == c("3m","10y")) %>%
#plotting with years on x-axis and yields on y-axis colored according to maturities
ggplot(mapping =
aes(x = date,
y = price,
group = var,
color = var)) +
geom_line() +
#to change position of the items in the legend
scale_color_discrete(breaks = c("3-Month Treasury Bill",
"10-Year Treasury Rate")) +
#labelling the axes and title
labs(title = "Yields on 3-Month and 10-Year Treasury Rates since 1999",
x = "",
y = "%",
caption = "Source: St. Louis Federal Reserve Economic Database (FRED)") +
theme_bw() +
#no title for the legend
theme(legend.title = element_blank())According to Wikipedia’s list of recession in the United States, since 1999 there have been two recession in the US: between Mar 2001–Nov 2001 and between Dec 2007–June 2009. Does the yield curve seem to flatten before these recessions?
Yes, it seems that between those two dates the yield curve flattens before inverting.
Can a yield curve flattening really mean a recession is coming in the US? According to the following graph, it seems that where the yield curve flattens, a reccession followed. Though it cannot be taken as the only factor, it should be taken as a warning signal.
Since 1999, when did short-term (3 months) yield more than longer term (10 years) debt?
The following graphs shows 9 occassions on when this happened, as we calculate the spread (10-year-3months) and label it ‘diff’
# get US recession dates after 1946 from Wikipedia
# https://en.wikipedia.org/wiki/List_of_recessions_in_the_United_States
#creates a dataframe with all US recessions since 1946
recessions <- tibble(
from = c("1948-11-01", "1953-07-01", "1957-08-01", "1960-04-01", "1969-12-01", "1973-11-01", "1980-01-01","1981-07-01", "1990-07-01", "2001-03-01", "2007-12-01"),
to = c("1949-10-01", "1954-05-01", "1958-04-01", "1961-02-01", "1970-11-01", "1975-03-01", "1980-07-01", "1982-11-01", "1991-03-01", "2001-11-01", "2009-06-01")
) %>%
mutate(From = ymd(from),
To=ymd(to),
duration_days = To-From)
recessions## # A tibble: 11 x 5
## from to From To duration_days
## <chr> <chr> <date> <date> <drtn>
## 1 1948-11-01 1949-10-01 1948-11-01 1949-10-01 334 days
## 2 1953-07-01 1954-05-01 1953-07-01 1954-05-01 304 days
## 3 1957-08-01 1958-04-01 1957-08-01 1958-04-01 243 days
## 4 1960-04-01 1961-02-01 1960-04-01 1961-02-01 306 days
## 5 1969-12-01 1970-11-01 1969-12-01 1970-11-01 335 days
## 6 1973-11-01 1975-03-01 1973-11-01 1975-03-01 485 days
## 7 1980-01-01 1980-07-01 1980-01-01 1980-07-01 182 days
## 8 1981-07-01 1982-11-01 1981-07-01 1982-11-01 488 days
## 9 1990-07-01 1991-03-01 1990-07-01 1991-03-01 243 days
## 10 2001-03-01 2001-11-01 2001-03-01 2001-11-01 245 days
## 11 2007-12-01 2009-06-01 2007-12-01 2009-06-01 548 days
#creating a new dataframe called g1
g1 <- yield_curve %>%
#choosing date, price, and symbol from existing dataframe
select(date,price,symbol) %>%
#a pivot wider to take values from price column and names from symbol column
pivot_wider(names_from = symbol, values_from = price) %>%
#create a new column called diff that calculates the spread
mutate(diff = GS10 - TB3MS)
g1 %>%
#plotting the graph using spread in y-axis and year in x-axis
ggplot(aes(x = date, y = diff)) +
#creating grey boxes using dataframe recessions to indicate the periods of recessions
geom_rect(data = recessions,
inherit.aes = FALSE,
aes(xmin = From,
xmax = To,
ymin = -Inf,
ymax = Inf),
alpha = 1,
fill = "grey")+
geom_line() +
#creating a line where y=0
geom_hline(yintercept = 0) +
#indicating color of graph (blue when it is above x-axis and red when it is below)
geom_ribbon(aes(ymin = 0,
ymax = pmax(diff, 0)),
fill = "#A6BDDB", alpha = 0.5) +
geom_ribbon(aes(ymin = pmin(diff, 0),
ymax = 0),
fill = "#FC787E", alpha = 0.5) +
#creating the lines at the bottom and filling the appropriate color
geom_rug(sides = "b", colour = case_when(g1$diff >= 0 ~ "#A6BDDB",
g1$diff < 0 ~ "#FC787E")) +
#changing the x-axis scale
scale_x_date(date_labels="%Y", date_breaks = "2 years", limits= c(as.Date("1959-01-01"), as.Date("2023-01-01"))) +
#labelling the axes and title
labs(title = "Yield Curve Inversion 10-Year minus 3-Month US Treasury Rates",
subtitle = "Difference in % points, monthly averages.\nShaded areas correspond to recessions",
x = "",
y = "Difference (10-Year - 3-Month) Yield in %",
caption = "Source: FRED, Federal Reserve Bank of St. Louis") +
theme_minimal() +
#no legend
theme(axis.text.x=element_text(size=5),
legend.position = "none",
plot.title=element_text(face="bold"), #making the title bold
plot.subtitle=element_text(face="italic", size = 8) #making the title bold
) The main components of gross domestic product, GDP are personal consumption (C), business investment (I), government spending (G) and net exports (exports - imports). The GDP data we looked at is from the United Nations’ National Accounts Main Aggregates Database. We looked at how GDP and its components have changed over time, and compareed different countries and how much each component contributes to that country’s GDP. The file we worked with is GDP and its breakdown at constant 2010 prices in US Dollars.
UN_GDP_data <- read_excel(here::here("data", "Download-GDPconstant-USD-countries.xls"), # Excel filename
sheet="Download-GDPconstant-USD-countr", # Sheet name
skip=2) # Number of rows to skipWe first tidied the data, and converted it from wide format to long, tidy format. We expressed all figures in billions and rename the indicators into something shorter.
tidy_GDP_data <- UN_GDP_data %>%
#Covert wide format to long format
pivot_longer(cols = 4:51,
names_to = "year",
values_to = "value") %>%
#Divide GDP value by 10^9, and create a column with shortened indicator names, respectively.
mutate(value = value/1e9,
year = as.numeric(year),
indicator = case_when(
IndicatorName == "Household consumption expenditure (including Non-profit institutions serving households)" ~ "C",
IndicatorName == "Gross Domestic Product (GDP)" ~ "GDP",
IndicatorName == "Imports of goods and services" ~ "Imports",
IndicatorName == "Exports of goods and services" ~ "Exports",
IndicatorName == "General government final consumption expenditure" ~ "G",
IndicatorName == "Gross capital formation" ~ "I"
)) %>%
#Remove country_id and indicator_name
select(-1,-3) %>%
filter(!is.na(indicator)) %>%
clean_names()
glimpse(tidy_GDP_data)## Rows: 63,072
## Columns: 4
## $ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan",…
## $ year <dbl> 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979,…
## $ value <dbl> 5.07, 4.84, 4.70, 5.21, 5.59, 5.65, 5.68, 6.15, 6.30, 6.15,…
## $ indicator <chr> "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C",…
# Let us compare GDP components for these 3 countries
country_list <- c("United States","India", "Germany")tidy_GDP_data %>%
#Choose only the countries we are interested in
filter(country %in% country_list, !indicator=="GDP") %>%
#Plot the lines of indicators we are interested in
ggplot(mapping = aes(x=year, y=value, group = indicator, color = indicator)) +
geom_line() +
#Draw the plot for each country
facet_wrap(~country)+
#Set mark points for x and y axes
scale_y_continuous(breaks = seq(0, 12500, 2500))+
scale_x_continuous(breaks = c(1970, 1980, 1990, 2000, 2010))+
scale_color_discrete(name="Components of GDP",
breaks=c("I", "Exports", "G", "C", "Imports"),
labels=c("Gross Capital formation", "Exports", "Government expenditure", "Household expenditure", "Imports"))+
theme_bw()+
coord_fixed(ratio = 0.007)+
theme(plot.margin=grid::unit(c(0,0,0,0), "mm"),
legend.margin=unit(.2,"inches"))+
labs(x = "",
y= "Billion US$",
title = "GDP components over time",
subtitle = "In constant 2010 USD")+
NULLAlso, GDP is the sum of Household Expenditure (Consumption C), Gross Capital Formation (business investment I), Government Expenditure (G) and Net Exports (exports - imports). Even though there is an indicator Gross Domestic Product (GDP) in the dataframe, we calculated it given its components discussed above.
tidy_GDP_data %>%
#Choose the countries we are interested in
filter(country %in% country_list) %>%
pivot_wider(values_from = "value",
names_from = "indicator") %>%
clean_names() %>%
#Create new columns of weight percentage of Household Expenditure (Consumption *C*), Gross Capital Formation (business investment *I*), Government Expenditure (G) and Net Exports (exports - imports) in calculated GDP; Also the percentage error between caculated and reported GDP.
mutate(calculated_gdp = c+g+i+exports-imports,
c_pro = c/calculated_gdp,
i_pro = i/calculated_gdp,
g_pro = g/calculated_gdp,
netex_pro = (exports - imports)/calculated_gdp,
percent_error = 1-calculated_gdp/gdp) %>%
#plot the items mentioned above, separated by countries, set mark points and end points
select("country", "year", "c_pro", "i_pro", "g_pro", "netex_pro", "percent_error") %>%
pivot_longer(cols = 3:7,
names_to = "indicator",
values_to = "value") %>%
ggplot(mapping = aes(x=year, y=value, group = indicator, color = indicator)) +
geom_line() +
facet_wrap(~country)+
scale_y_continuous(limits = c(-0.2, 0.75), breaks = seq(-0.2, 0.6, 0.2), labels = scales::percent)+
scale_x_continuous(breaks = c(1970, 1980, 1990, 2000, 2010))+
scale_color_discrete(name="",
breaks=c("g_pro", "i_pro", "c_pro", "netex_pro","percent_error"),
labels=c("Government expenditure", "Gross Capital formation", "Household expenditure", "Net exports", "Percentage error between reported and caculated GDP"))+
theme_bw()+
labs(title = "GDP and its breakdown at constant 2010 prices in US dollars",
x= "",
y="proportion") +
coord_fixed(ratio = 60)+
theme(plot.margin = unit(c(0.2, 0.2, 0.2, 0.2), "inches"),
legend.margin=unit(.2,"inches"))We added a line to demonstrate the percentage error between reported GDP and calculated GDP and observed from the plots that the error could range from 0 to 5%.
Besides, the last chart tells us a few things:
1) German percentage of net exports in GDP increased in 2000-2017, while India and USA mostly had negative net exports. One possible reason is that Germany is a leader in industrials, producing products with higher value, thus its net export was positive.
2) India and USA have the higher percentage of household expenditure in GDP than Germany, with India’s decreasing and USA’s increasing. This probably was because that USA and India have higher economy growth rate than Germany since USA has the highest percentage of household expenditure in GDP, indicating higher domestic market demand.
3)India and Germany have higher percentage of gross capital formation in GDP than USA, with India’s grew significantly in 2000-2017. One possible explanation is that India attracted many businesses to invest in India in 2000-2017 due to its lower labor costs than Germany and USA.
4) USA and Germany have higher percentage of government expenditure in GDP than India. This might mean that Governments of USA and Germany have better financial conditions than government of India.